For a large class of applications, the standard RAG architecture is illegal. Not slow, not expensive, illegal. The moment you embed a user's medical history, a company's proprietary code, or regulated PII and ship those vectors to a cloud vector database, you have moved sensitive data across a boundary that GDPR, HIPAA, or an enterprise data-residency policy says it cannot cross. Embeddings are not anonymisation; they are a lossy but reversible-enough representation of the source.

The usual answers are to sign a business associate agreement, self-host the vector DB in a compliant region, or scrub the data first. Sometimes those work. Often the only answer the compliance team will accept is the strongest one: the data never leaves the device at all. That is the architecture I built for a privacy-first wellness app, and here is how on-device RAG works without crippling the phone.

1. The privacy wall, precisely

The compliance objection is not vague nervousness, it is specific. Cloud RAG creates at least three crossings of the trust boundary: the raw text leaves the device to be embedded (unless you embed locally), the vectors are stored on third-party infrastructure, and the retrieved context is sent to a hosted LLM. Each crossing is a place an auditor will stop you. Zero-cloud means collapsing all three: embed on device, store on device, and either generate on device or send only the minimum, user-consented context out.

AES-256
Database encrypted at rest via SQLCipher, key held in hardware-backed secure storage.
On-device
Both lexical (FTS5) and semantic (quantized embeddings) retrieval run locally. No vector DB call.
0 crossings
Sensitive text and its embeddings never leave the device unencrypted.

2. Encryption at rest, anchored in hardware

The foundation is an encrypted database. SQLCipher wraps SQLite with transparent AES-256 encryption: every page is encrypted on disk, and the database is unreadable without the key. The discipline that makes this real rather than theatre is where the key lives. It is never hard-coded and never stored in plain application storage. It lives in the platform's hardware-backed secure enclave (the iOS Keychain / Secure Enclave, the Android Keystore), reached through expo-secure-store, and is released to the app only after device authentication.

Pattern — open an encrypted store with a key from the secure enclave import * as SecureStore from 'expo-secure-store'; // Key is generated once, sealed in the hardware-backed keystore, // and never written to ordinary storage or sent anywhere. const key = await SecureStore.getItemAsync('db_key', { requireAuthentication: true, }); // SQLCipher applies the key; every page is AES-256 encrypted at rest. db.exec(`PRAGMA key = "x'${key}'"`); db.exec(`PRAGMA cipher_memory_security = ON`);

If the device is lost, the database on disk is ciphertext and the key is sealed in hardware that resists extraction. That is the property the compliance team actually wants: not "we trust our servers," but "the data is unreadable without this specific authenticated device."

3. Two retrievers, both local

On-device retrieval has to do the same job as cloud hybrid search (catch exact terms and semantic matches) without a server. The answer is the same two-retriever shape, implemented with what the phone already has.

Lexical: SQLite FTS5. SQLite ships a full-text search engine, FTS5, that indexes content for fast keyword and phrase queries entirely in-process. It is the on-device equivalent of BM25, and it is essentially free because the database is already SQLite:

SQL — an FTS5 virtual table gives local lexical search CREATE VIRTUAL TABLE notes_fts USING fts5(content, tokenize = 'porter'); -- Fast, local, no network: ranked keyword retrieval over encrypted content. SELECT rowid, rank FROM notes_fts WHERE notes_fts MATCH ? ORDER BY rank LIMIT 50;

Semantic: a quantized embedding model. The semantic half runs a small, quantized embedding model on-device. Quantization (int8 or 4-bit weights instead of float32) shrinks the model enough to load and run on a phone without draining the battery or stalling the UI, at a modest and acceptable cost to embedding quality. Each note is embedded once at write time; the vectors are stored in the same encrypted database and compared with cosine similarity at query time. No embedding API call, no vector-DB round trip.

Fuse the two ranked lists the same way cloud RAG does, with Reciprocal Rank Fusion, and you have on-device hybrid retrieval: exact-term recall from FTS5, semantic recall from local embeddings, combined without either touching the network.

4. The tradeoffs, stated honestly

On-device RAG is not a free lunch, and pretending otherwise is how you ship something that dies in the App Store reviews:

When to reach for zero-cloud Choose on-device RAG when the data is regulated, personal, or proprietary, when the corpus is per-user rather than global, and when "the data never leaves the device" is a feature you can sell. Choose cloud RAG when the corpus is large, shared, and not sensitive. Most apps are one or the other; a few need both, partitioned by data class.

5. The stack

The implementation is a Next.js 15 web app and an Expo 54 React Native mobile app sharing logic, with SQLCipher for the encrypted store, expo-secure-store for hardware-backed key custody, SQLite FTS5 for lexical retrieval, and a quantized on-device embedding model for semantic retrieval. The same hybrid-plus-fusion retrieval shape I use in my cloud RAG engine, rebuilt to run entirely within the trust boundary of a single authenticated phone.

What I Built

This architecture powers the privacy-first retrieval in WellnessInYou, a full-stack wellness platform where user data is personal by definition and on-device by design. It is the same conviction behind the rest of my work: the impressive part of an AI system is rarely the model, it is the engineering around it that makes the system trustworthy enough to put in front of real users with real data.